Exploratory Data Analysis - White Wine

by Stephen Griffiths

Introduction

In this project, exploratory data analysis techniques are applied to a dataset containing the physical and chemical properties and perceived qualities of a large number of Portuguese ‘Vinho Verde’ white wines. The relationships among multiple variables are explored in order to ascertain which properties influence the quality of these white wines.

Dataset

The dataset is available from:

The dataset contains 4,898 white wines with 11 variables on quantifying the physical and chemical properties of each wine. At least 3 wine experts rated the quality of each wine, providing a rating between 0 (very bad) and 10 (very excellent).

The dataset contains the following attributes:

Input variables (based on physicochemical tests):

1 - fixed acidity (tartaric acid - \(g/dm^3\)) - most acids involved with wine or fixed or nonvolatile (do not evaporate readily).

2 - volatile acidity (acetic acid - \(g/dm^3\)) - the amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar taste.

3 - citric acid (\(g/dm^3\)) - found in small quantities, citric acid can add ‘freshness’ and flavor to wines.

4 - residual sugar (\(g/dm^3\)) - the amount of sugar remaining after fermentation stops, it’s rare to find wines with less than 1 \(g/dm^3\) and wines with greater than 45 \(g/dm^3\) are considered sweet.

5 - chlorides (sodium chloride - \(g/dm^3\)) - the amount of salt in the wine.

6 - free sulfur dioxide (\(mg/dm^3\)) - the free form of SO2 exists in equilibrium between molecular SO2 (as a dissolved gas) and bisulfite ion; it prevents microbial growth and the oxidation of wine.

7 - total sulfur dioxide (\(mg/dm^3\)) - amount of free and bound forms of SO2; in low concentrations, SO2 is mostly undetectable in wine, but at free SO2 concentrations over 50 ppm, SO2 becomes evident in the nose and taste of wine.

8 - density (\(g/cm^3\)) - the density of wine is close to that of water depending on the percent alcohol and sugar content.

9 - pH - describes how acidic or basic a wine is on a scale from 0 (very acidic) to 14 (very basic); most wines are between 3-4 on the pH scale.

10 - sulphates (potassium sulphate (K2SO4) - \(g/dm^3\)) - a wine additive which can contribute to sulfur dioxide gas (SO2) levels, wich acts as an antimicrobial and antioxidant.

11 - alcohol (% by volume) - the percent alcohol content of the wine.

Output variable (based on sensory data):

12 - quality (score between 0 and 10) - median of at least 3 evaluations made by wine experts.

Data Structure

We must first examine the structure of our dataset:

## 'data.frame':    4898 obs. of  13 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
##  $ volatile.acidity    : num  0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
##  $ citric.acid         : num  0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
##  $ residual.sugar      : num  20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
##  $ chlorides           : num  0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
##  $ free.sulfur.dioxide : num  45 14 30 47 47 30 30 45 14 28 ...
##  $ total.sulfur.dioxide: num  170 132 97 186 186 97 136 170 132 129 ...
##  $ density             : num  1.001 0.994 0.995 0.996 0.996 ...
##  $ pH                  : num  3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
##  $ sulphates           : num  0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
##  $ alcohol             : num  8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
##  $ quality             : int  6 6 6 6 6 6 6 6 6 6 ...

Univariate Plots Section

To get an initial feel for the data we will plot the distributions:

Quality:

Quality is our main feature of interest. We will take a look at the statistics:

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   5.000   6.000   5.878   6.000   9.000

Although there are no bad wines (rating 0-2) and no excellent wines (rating 10), for completeness we define a categorical variable for the quality based on the following:

Poor (0 to 3)

Average (4 to 6)

Good (7 to 10)

The dataset contains the following number of entries for each category:

## 
##    Poor Average    Good 
##      20    3818    1060

Quality follows a near normal distribution with the majority of wines falling in the higher end of the average category.


Many of the variables have skewed distributions with significant outliers. We will look at these in more detail.

Acidity:

Fixed acidity has a normal distribution with some outliers so we will clip the outer 10% of values.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.800   6.300   6.800   6.855   7.300  14.200

The ‘striped’ appearance is due to the data being recorded to 1 decimal place, with the exception of values at 6.15 \(g/dm^3\), 6.45 \(g/dm^3\) and 7.15 \(g/dm^3\).

Volatile acidity has a slightly skewed distribution so we will use a log transformation.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0800  0.2100  0.2600  0.2782  0.3200  1.1000

The distribution has same ‘striped’ appearance as fixed acidity below the mean, for the same reasons as stated earlier. The median (0.2600 \(g/dm^3\)) is less than the mean (0.2782 \(g/dm^3\)) highlighting the slight positive skew.

Citric acid appears normally distributed with some positive outliers so we will clip the outer 10% of values.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.2700  0.3200  0.3342  0.3900  1.6600

There is a slight positive skew with a median (0.3200 \(g/dm^3\)) slightly less than the mean (0.3342 \(g/dm^3\)). The are also a number of significant spikes in the data, for example at 0.49 \(g/dm^3\) and 0.74 \(g/dm^3\).

PH appears to have a normal distribution.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.720   3.090   3.180   3.188   3.280   3.820

Sugar:

Residual sugar is quite positively skewed with significant outliers so we will use a log transformation and clip the outer 2% of values.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.600   1.700   5.200   6.391   9.900  65.800

Residual sugar shows an overall bi-modal distribution with a significant peaks around 1.5 \(g/dm^3\) and multiple peaks centered around 10 \(g/dm^3\). According to the limits described earlier, there are 77 wines that are considered unsweet, and 1 wine considered sweet with 65.8 \(g/dm^3\) of residual sugar.

Salts:

Chlorides appear to have a nearly normal distribution with many positive outliers so we will clip the outer 10% of values.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00900 0.03600 0.04300 0.04577 0.05000 0.34600

Further exploration reveals chlorides to have a slight positive skew with the median (0.04300 \(g/dm^3\)) less than the mean (0.04577 \(g/dm^3\)). There is also a large range (0.009 \(g/dm^3\) - 0.346 \(g/dm^3\)), with many positive outliers.

Sulfur:

Free sulfur dioxide (SO2) follows a normal distribution with positive outliers.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    2.00   23.00   34.00   35.31   46.00  289.00

Total sulfur dioxide (SO2) also follows a normal distribution with positive outliers.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     9.0   108.0   134.0   138.4   167.0   440.0

Potassium sulphate (K2SO4) has a slightly skewed distribution so we will use a log transformation.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2200  0.4100  0.4700  0.4898  0.5500  1.0800

The distribution shows multiple significant peaks.

Density:

Density appears to have a nearly normal distribution with positive outliers so we will clip the outer 20% of values.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9871  0.9917  0.9937  0.9940  0.9961  1.0390

Density has a slight positive skew with the median (0.9937 \(g/dm^3\)) less than the mean (0.9940 \(g/dm^3\)). There are also several significant peaks.

Alcohol:

Alcohol is positively skewed so we will use a log transformation.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.00    9.50   10.40   10.51   11.40   14.20

Alcohol is slightly positively skewed with a mean of 10.51% and a median of 10.40%. The most common wines have about 9.25% alcohol.


Univariate Analysis

What is the structure of your dataset?

The dataset contains 4,898 white wines with 11 variables on quantifying the physiochemical properties of each wine. At least 3 wine experts rated the quality of each wine, providing a rating between 0 (very bad) and 10 (very excellent). The structure of the data is as decribed above.

What is/are the main feature(s) of interest in your dataset?

The objective of the analysis is to explore any relationships that may exist between the different physiochemical properties of each wine. With perceived quality being the main feature of interest, we will also look at which properties are most influencial on wine quality.

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

At this point we will look at all the features in the dataset. Without more in depth exploration we cannot be certain which properties are most influencial on wine quality.

Did you create any new variables from existing variables in the dataset?

We defined a categorical variable for the wine quality based on the perceived quality rating.

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

Volatile acidity, residual sugar, sulphate and alcohol all exhibited skewed distributions so we applied log transformations in order to better visualise the data. Several of the features also contained outliers so we clipped the chart axis to exclude these values. Residual sugar followed a bi-modal distribution, hinting at two distinct groups. Several of the features also contained significant peaks at various points in their distributions.


Bivariate Plots Section

We will first calculate the correlations between each of the features.

The correlation plot shows quality is positively correlated with alcohol, and negatively correlated with density. There are other strong correlations that are evident between the different properties so we will examine the strongest of these also.

There is a definite trend of increased perceived quality with increasing alcohol content.

The above plot seems to show that as density decreases, the perceived quality increases.

Strongest Correlations:

## 
##  Pearson's product-moment correlation
## 
## data:  density and residual.sugar
## t = 107.87, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.8304732 0.8470698
## sample estimates:
##       cor 
## 0.8389665

As residual sugar content increases, so does the density of the wine. The above plot also hints at two distinct groups of wine based on residual sugar content.

## 
##  Pearson's product-moment correlation
## 
## data:  total.sulfur.dioxide and free.sulfur.dioxide
## t = 54.645, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.5977994 0.6326026
## sample estimates:
##      cor 
## 0.615501

As total sulfur dioxide increases so does free sulfur dioxide.This is expected since free sulfur dioxide is a subset of total sulfur dioxide.

## 
##  Pearson's product-moment correlation
## 
## data:  density and total.sulfur.dioxide
## t = 43.719, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.5094349 0.5497297
## sample estimates:
##       cor 
## 0.5298813

The more the total sulphur dioxide content, the higher the density of the wine.

## 
##  Pearson's product-moment correlation
## 
## data:  total.sulfur.dioxide and residual.sugar
## t = 30.669, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.3776791 0.4246712
## sample estimates:
##       cor 
## 0.4014393

## 
##  Pearson's product-moment correlation
## 
## data:  alcohol and density
## t = -87.255, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.7908646 -0.7689315
## sample estimates:
##        cor 
## -0.7801376

There is a strong correlation here. As the alcohol level increases, the density of the wine decreases.

## 
##  Pearson's product-moment correlation
## 
## data:  alcohol and residual.sugar
## t = -35.321, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.4726723 -0.4280267
## sample estimates:
##        cor 
## -0.4506312

As the alcohol level increases, the residual sugar content of the wine also decreases.

## 
##  Pearson's product-moment correlation
## 
## data:  alcohol and total.sulfur.dioxide
## t = -35.15, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.4709775 -0.4262443
## sample estimates:
##        cor 
## -0.4488921

## 
##  Pearson's product-moment correlation
## 
## data:  alcohol and chlorides
## t = -27.016, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.3843183 -0.3355673
## sample estimates:
##        cor 
## -0.3601887

## 
##  Pearson's product-moment correlation
## 
## data:  pH and fixed.acidity
## t = -32.934, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.4485154 -0.4026542
## sample estimates:
##        cor 
## -0.4258583

Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

The main feature of interest, quality, was found to be positively correlated with alcohol, and negatively correlated with density. Perceived quality tends to increase with increasing alcohol content and decreasing density.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

The most highly correlated features were examined and some interesting relationships were found. As the residual sugar content increases the density also increases. This is expected because any remaining sugar after fermentation will dissolve in the wine, therefore adding to the density. Density was also found to decrease with increasing alcohol level. This can be explained by the fact that during fermentation, sugar is turned into alcohol, so the higher the alcohol level the less residual sugar, and hence less density.

What was the strongest relationship you found?

The strongest relationship was residual sugar vs density, with a Pearson correlation coefficient of 0.8389665. This was closely followed by density vs alcohol, with a Pearson correlation coefficient of -0.7801376.


Multivariate Plots Section

We will look at the three most highly correlated pairs of features and examine their effect on perceived wine quality.

The trends are similar for all three categories of wine quality. The plot shows that for any given density, the higher the residual sugar content the higher the perceived quality.

Again the trends are similar for all three categories of wine quality. Generally for any given density, the higher the alcohol content the higher the perceived quality.

While the correlation for the above features is quite high (0.615501), it is difficult to determine the effect on perceived wine quality from the above plot.


Multivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

Quality is closely affected by alcohol content, which is in turn closely affected by both density and residual sugar. For any given density, the higher the residual sugar content the higher the perceived quality. However, this trend is masked by the stronger relationship between densty and quality, since residual sugar adds to the density. The higher quality wines tend to have higher alcohol levels and lower densities.

Were there any interesting or surprising interactions between features?

Free sulfur dioxide correlated relatively highly with total sulfur dioxide, with a Pearson correlation coefficient 0.615501. However when plotted and coloured by quality category it was difficult to determine any effect on perceived quality from the plot.


Final Plots and Summary

Plot One

Description One

Quality follows a near normal distribution with the majority of wines falling in the higher end of the average category. Since there are only a small number of the highest and lowest ratings, what are the actual counts for each rating?

## 
##    3    4    5    6    7    8    9 
##   20  163 1457 2198  880  175    5

Plot Two

We shall take the plot of density vs residual sugar by quality category and break it by quality rating to better visualise each contribution:

Description Two

The trends are similar for all three categories of wine quality. The plots shows that for any given density, the higher the residual sugar content the higher the perceived quality. While the overall correlation coefficient is 0.8389665, what are the correlation coefficients for each quality category?

## wine$quality_category: Poor
## [1] 0.708568
## -------------------------------------------------------- 
## wine$quality_category: Average
## [1] 0.8543445
## -------------------------------------------------------- 
## wine$quality_category: Good
## [1] 0.820208

Plot Three

We shall take the plot of alcohol vs density by quality category and break it by quality rating to better visualise each contribution:

Description Three

Again the trends are similar for all three categories of wine quality. Generally for any given density, the higher the alcohol content the higher the perceived quality. While the overall correlation coefficient is -0.7801376, what are the correlation coefficients for each quality category?

## wine$quality_category: Poor
## [1] -0.6135293
## -------------------------------------------------------- 
## wine$quality_category: Average
## [1] -0.7357538
## -------------------------------------------------------- 
## wine$quality_category: Good
## [1] -0.8436137

Reflection

A dataset of 4,898 Portuguese ‘Vinho Verde’ white wines was explored using exploratory data analysis. The dataset contained 11 variables on quantifying some of the physiochemical properties, together with an expert quality rating of each wine. The objective of the analysis was to explore any relationships between the different physiochemical properties of each wine, and their influence on perceived wine quality.

After examining the structure of the dataset a categorical variable for the quality was defined and the histograms of each variable were plotted to see if there were any unusual distributions. Some of the variables had skewed distributions with significant outliers so were transformed and clipped enable better visualisation. The most striking distribution was that of residual sugar which was found to be bi-modal.

The relationships between each of the features were calculated using the Pearson correlation coefficients. The correlation plot showed quality was positively correlated with alcohol, and negatively correlated with density. The most highly correlated features were examined and some interesting relationships were found. Quality was closely affected by alcohol content, which itself was closely affected by both density and residual sugar. The higher quality wines tend to have higher alcohol levels and lower densities. Free sulfur dioxide correlated relatively highly with total sulfur dioxide, but when plotted with quality it was difficult to determine any effect on perceived quality from the plot.

The dataset is limited in that it only contains samples from a specific region of Northern Portugal and does not detail the grapes used. There are many different grapes used in Vinho Verde white wines. This could have a huge influence on quality. The dataset also has severe limitations in terms of the number of datapoints at the extremes of quality rating.

For the future, a predictive model could be created to predict wine quality based on supplied properties, although I feel data from a wider quality range is needed for this.

References